The dataset used in the laboratory 2 assignment is from Kaggle and about bikesharing in London, UK. Only the training dataset is used because we are only interested in the observations that have the dependent variable cnt. The data is from 2011 and 2012. The launch of the bikesharing company was in 2010. Several features about the weather conditions and the time in year are provided. Since we have variables indicating the year, season, month, weekday and hour, we can delete the timestamp variable. Also, the variable ID is deleted because in R, the program I am doing this report with, is indexing the rows anyway.
In addition to the weather and time variables, the variable cnt indicates the count of new bike shares. It is grouped by start time of the share per hour, i.e. the count of new bike shares by hour. The long duration shares are not taken in the count. cnt is used as the dependent variable and the mean is 188.4 (red line in the plot). Summed up, this makes 13 variables.
In total, we have 8690 observations, i.e. a count of new bike shares in this particular hour plus the corresponding weather information. The data is complete, we have no missing values.
As the bikesharing data was collected soon after the company launched the program, the number of new bike shares per hour is increasing, probably due to an increased popularity of the brand. In the plot below, one can clearly see the difference between 2011 and 2012. Whereas in 2011 most of the counts per hour were close to 0, this tendency decreased in 2012.
I discovered the difference by plotting the density of the new bikeshares variable cnt and grouping the data based on the other variables.
For the analysis of how the features influence the frequency of new bikeshares per hour, a lasso regression will be used. The lasso regression puts only weak assumptions on the data which makes it a versatile and flexible technique. In addition, lasso constraints the coefficients individually which avoids overfitting and leads to a feature selection to simplify the model. For the model selection, we can either use the minimum error criterion which simply reduces the MSE to a minimum or the most regularized model which is a simpler model and still within 1 standard deviation from the error minimum. In practise, the latter criterion is often preferred.
In the plot above one can observe the prediction error. The black dots represent the actual data plotted against the predicted data of the bikeshare count per hour. The red line is the regression line which resembles the perfect fit between prediction and actual results. Since there is spread around it, our predictions is uncertain which is reflected by the prediction error.
| 3-feature model | Most regularized | Minimum error | |
|---|---|---|---|
| (Intercept) | 106.56 | -30 | -111.15 |
| season | excluded | 9.11 | 19.06 |
| yr | excluded | 58.28 | 80.02 |
| mnth | excluded | excluded | 0.28 |
| hr | 3.11 | 6.45 | 7.59 |
| holiday | excluded | excluded | -13.37 |
| weekday | excluded | excluded | 2.10 |
| workingday | excluded | excluded | 6.00 |
| weathersit | excluded | excluded | -2.71 |
| temp | 121.62 | 98.15 | 40.54 |
| atemp | excluded | 169.04 | 268.00 |
| hum | -23.31 | -153.1 | -193.63 |
| windspeed | excluded | excluded | 49.66 |
The table above provides us with the information about the lasso regression model, once with the option to choose the parameter to create the most regularized model, i.e. the model with the lowest number of coefficients while keeping the prediction error low. The second model reduces the prediction error to a minimum and also includes variables which have a low but existing effect on the dependent variable cnt. Lastly, the 3-feature model contains the three most influential variables. In the following, we will use the most regularized model with six variables.
The number of new bikeshares is mainly influenced by 1. the feeled temperature atemp, 2. humidity hum and 3. the particular hour hr. Season, year and temperature are also in the model, however, their influence is weaker.
I discovered the main influential factors with the lasso regression presented under “Discovery” in “Insight No. 1”.
This plots includes only features which were tested to have a significant influence on the count of new bikeshares per hour.
From late in the evening around 11pm to 5am, the number of bikeshares is rather low on most of the days, i.e. the variance is rather small. This indicates at night constantly only a few bikeshares are made. In contrast, more new bikeshares are made during the day with a high variance in the exact number per day, especially during 8am and 5pm. During the day, the influence of the other factors influencing the number of new bikeshares explained in “Insight No. 2” seems to be higher (temperature, humidity etc.).
This discovery is also based on the regression where hour was exposed as an important factor and I wanted to have a closer look at the association between daytime and the number of new bikeshares. The following ridgeline plot shows that in more detail.
## Picking joint bandwidth of 27.5
All of variables I have shown in a certain context have proven to be statistically significant when looking at the lasso regression analysis. However, we need to keep in mind that statistical significance does not neccissarily mean practical relevance. For instance, the first insight where the year 2011 and 2012 are compared clearly shows different patterns for each year, though year has only a minor influence on the count of new bikeshares. If penalizing the model enough through the lasso constraint, the variable year is removed even before temperature. This is indeed proof for the low relevance since temperature temp and feeled temperature atemp are highly correlated (r = 0.99) and those two variables stay in the model together longer than year.
The first chart is about the year difference in the distribution of the count of new bikeshares per hour, thus a grouped density plot.
The second plot is a simple bar chart in which each of the features (in total: 6) of the most regularized statistical model is represented by a bar. The part’s size implies the importance or the weight of the variable in the model. The weight is calculated by the coefficient times the values, the independent variable can take (e.g., hour: 3.11 * 23 since hour is max. 23).
Note that one can only get the coefficient weights when calculating the lasso regression which is possible in R and Python but not in the applications. For the sake of visualization, I added the coefficient weights manually to the data and visualized them with the apps.
The feature hour is an essential explanatory variable in our model and in this plot, we can see why that is. The day time strongly influences the number of new bikeshares with peaks at around 8am and 5-6pm which one can clearly observe in the ridgeline plot.
I try to select libraries and applications which I might encounter in the future during my studies or on the job. For example, in an internship I will start soon, I have to work with Tableau.
Libraries:
1. ggplot (R).
2. R basic charting.
3. Python.
Applications:
4. Datawrapper.
5. Tableau.
6. Excel.
## Picking joint bandwidth of 27.5
A ridgeline plot could be created, however, one could create 24 individual density plots. One where the day hour is 17 was created for illustration.
All plots could be reproduced, however, the ridgeline plot has thick horizontal base lines which make it difficult to observe the densities.
Datawrapper could not handle densities which is why two of three plots could not be recreated.
Distribution per Year; Distribution per Hour; Coefficient Weights
Tableau was able to recreate the plots, the density plots in a somewhat adjusted way. The densities are not relative but absolute frequency distributions. The ridgeline plot is not really a ridgeline plot but the 24 distributions in one plot. Because of the interactiveness, one can click on each of the day hours to observe the cnt distribution.
Excel could also not handle densities but represented the distributions by absolute frequency distributions. Excel has a somewhat different way of visualizing the frequencies because I had to create bins to have a smooth line in the plot. When counting the frequencies of
cnt with bin size 1, each outlier would result in a spike in the diagram and within the data are hundreds of those outliers which would make the chart unreadable. For the ridgeline plot, spark lines were used to have a similar chart to the ridgeline plot. In principal, it worked but unfortunately, there is no x-axis with the values from 0 to 1000.
The grade scale goes from 1 (bad/not supported) to 10 (great/highly supported). Every grading process is shortly described.
| Tool | Data Handling | User Guidance | Feature Richness | Innovation | Flexibility | Interaction |
|---|---|---|---|---|---|---|
| ggplot | 9 | 7 | 8 | 8 | 9 | 1 |
ggplot handles data very well since ggplot is used in the R environment which is basically build for data manipulation and is able to load very different formats. The documentation is good but there is not one comprehensible manual which is able to easily explain all its possibilities. However, using search engines, one can quickly find help. The lack of one manual might be caused in its feature richness and its expandability. The components in ggplot are like blocks which are literally added on top of each other (e.g., ggplot(data, aesthetics + ggtitle(“…”) + geom_bar(…))). This way, a vast range of innovative plots can be created. The interaction is non-existent in ggplot itself. Other packages like plotly rely on ggplot as the visualization tool and enable interaction with the graphic.
| Tool | Data Handling | User Guidance | Feature Richness | Innovation | Flexibility | Interaction |
|---|---|---|---|---|---|---|
| R | 9 | 8 | 5 | 3 | 7 | 1 |
The basic R plotting tools work similarly to ggplot but are limited to the basics in plotting. Also, the design is very basic. More advanced graphs like the ridgeline plots are not possible. However, we could plot each plot per hour individually which I did once as an example. The user guidance is very good which is also caused in the simplicity of features the basic R has. Like ggplot the basic R visualization methods are also relatively flexible but again, no interaction is possible.
| Tool | Data Handling | User Guidance | Feature Richness | Innovation | Flexibility | Interaction |
|---|---|---|---|---|---|---|
| Python | 6 | 8 | 9 | 8 | 5 | 1 |
When talking about using Python, this will imply that I used modules because Python on its own has no visualizations implemented. With pandas, a data manipulation module in Python, one can handle data well although it is not as intuitive as a dataframe or matrix in R. In addition, I needed to transform the data to make use of the Python module which was rather complicated. The user guidance is very well organized, though it depends on the packages used. There are plenty of features one can use, however, many are limited within their own packages. For example, one can use the features of some package on the plots produced by the package but combining features of different packages is rarely possible, i.e. the packages are usually flexible themselves but not in combination. Interactive plots can be built with dashy but require a lot of extra work which goes beyond the scope of the classical Python user. Thus it is rated like the R tools as “not supported”. Note that in the ridgeline plot, the densities are better recognizable in the ggtool-plot compared to the Python plot which is definitely an advantage of ggplot but is not really represented in the rating.
| Tool | Data Handling | User Guidance | Feature Richness | Innovation | Flexibility | Interaction |
|---|---|---|---|---|---|---|
| Datawrapper | 2 | 6 | 4 | 2 | 5 | 1 |
Besides a comprehensive importing tool which recognizes most of the common data formats and types automatically, one cannot manipulate the data besides determining string/numeric data types and selecting the first row as the header row. The user guidance is satisfactory. One is well guided by the company’s guidance but there is not much support from a community such as Stack Overflow for Java, Python, R etc. The range of plots is limited to the demands of journals and newspapers since most of users seem to be newspapers such as New York Times, Süddeutsche and Zeit. The demands of those mostly cover extensive bar, pie and line plots, sometimes also maps. The application struggled a lot with frequencies and densities. It was not possible to do any kind of density plot, not even a histogram could be configured by manipulating the data into bins. There was only a histogram in the import phase when clicking on each variable and checking the data format (see the first plot of datawrapper above the distribution of cnt). Thus the feature richness is limited to basic plots one could expect in a newspaper but in this regard, the options are quite plentyful. Innovation is in consequence quite weak, flexibility within the existing plots satisfactory and there are no interaction possibilites.
| Tool | Data Handling | User Guidance | Feature Richness | Innovation | Flexibility | Interaction |
|---|---|---|---|---|---|---|
| Tableau | 3 | 6 | 7 | 5 | 7 | 8 |
Tableau has similar import features like Datawrapper but can handle more file formats plus the data type settings are more advanced. Tableau has a large community and a proper guide by the developers. In addition, it does not appear to be limited to one particular type of plots. It offers a lot of features and options to build custom plots. This way, a range of more innovative plots can be created in a flexible manner. When it gets more complex, the user guidance is however not as strong as for more common plots. The density plots are rather distributional plots which visualize more or less the same properties but in absolut numbers, not in relative frequencies. I also had to adjust the representation of cnt by year and hour in a way that one can interactively click on the years and hours with detailed information popping up. There was no possibility of creating a ridgeline plot except by doing it customly which requires a lot more experience than I have at the moment. The interactivity is a great strength of Tableau. Every plot is automatically interactive and custom interactions can be added.
| Tool | Data Handling | User Guidance | Feature Richness | Innovation | Flexibility | Interaction |
|---|---|---|---|---|---|---|
| Excel | 7 | 9 | 7 | 4 | 6 | 1 |
Excel has its pros and cons when it comes to data handling. On the one hand, excel is quite stubborn when it comes to data types. One has to make sure that the data is classified correctly, i.e. one has to determine whether it is numeric and within the numerics, it is even necessary to give information about whether it is a count, sum, average etc. On the other hand, working with the Pivot tables is quite convenient because one can easily set columns and rows and filter on those variables. Although it is quite a hussle to have the data correctly organized for its purposes, the possibilities are enormous. As Excel is a popular software, there is a lot of guidance online. This way, the actually quite limited standard feature selection given by Excel can be extended to a lot of new charts. However, there is a limit to innovative plots. Some, for example the ridgeline plot, can be realized by top-class excel users but are very complicated for normal users. The richness of settings paired with the online user guidance forums leads to a high flexibility, having in mind that it takes a certain level of skill. Lastly, interactions are not possible. One can however dynamically change the static plots by checking and unchecking boxes to rearrange the data selection for the plots.
| Tool | Data Handling | User Guidance | Feature Richness | Innovation | Flexibility | Interaction |
|---|---|---|---|---|---|---|
| ggplot | 9 | 7 | 9 | 8 | 9 | 1 |
| R | 9 | 8 | 5 | 3 | 7 | 1 |
| Python | 6 | 8 | 9 | 8 | 5 | 1 |
| Datawrapper | 2 | 6 | 4 | 2 | 5 | 1 |
| Tableau | 3 | 6 | 7 | 5 | 7 | 8 |
| Excel | 7 | 9 | 7 | 4 | 6 | 1 |